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Abstract 

We present a mathematical model, and the corresponding mathematical analysis, 
that justifies and quantifies the use of principal component analysis of biallelic 
genetic marker data for a set of individuals to detect the number of subpopula- 
tions represented in the data. We indicate that the power of the technique relies 
more on the number of individuals genotyped than on the number of markers. 
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1. Introduction 



Principal component analysis (PCA) has been a powerful and efficient method 
for analyzing large datasets in population genetics since its early applications 



by Cavalli-Sforza and others (Menozzi et al. 1978, Cavalli-Sforza et al. 1993 



1994). In particular, PCA of single nucleotide polymorphism (SNP) genotype 



data can be used to illuminate population structure (Nelson et al. , 2008), provide 
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insights into human history and admixture (Novembre et al. 2008; McVean 



2009), and help to estimate the number of distinct subpopulations within a sample 



(Patterson et al. 2006). PC A can also be applied to correct for population 
structure within a sample of individuals, to prevent spurious results in conducting 
medical genetic studies such as genome- wide association studies (GWAS), which 



seek to find genes underlying diseases or traits (Price et al. 2006; Zhu et al. 



2002). 



In this paper, we provide additional mathematical confirmation for the use of 
PC A in estimating the number of subpopulations within a sample. In a related 



result (Patterson et al. 2006, Theorem 3) that motivated this research, the au- 
thors analyze the theoretical centered covariance matrix for a single marker as the 
number of individuals increases without bound. Here we analyze a mathemati- 
cally more complicated object: the sample covariance matrix based on multiple 
markers. In current practice, the sample covariance matrix is often centered, 
and the data rows are often further normalized. In contrast to previous work, 
our results describe behavior of the eigenvalues of the sample covariance matrix 
without centering or normalization, taking into account both the number of in- 
dividuals and the number of markers. The raw unprocessed covariance matrix 
is more amenable to mathematical analysis, and the singular values of such raw 
data exhibit quantifiable properties that can be used directly to determine the 
number of populations in the data in an almost deterministic fashion, at least 
when the number of individuals in the study is sufficiently large. 

We show that for large data sets of individuals from K well-differentiated 
subpopulations, with overwhelming probability the un-centered sample covari- 
ance matrix has K large eigenvalues. (The technical meaning of "well differen- 



tiated subpopulations" is that matrix Q, which we later define in equation (4.8) 
from the pairwise moments of pairwise site spectra, is non-singular.) These large 
eigenvalues, which indicate the presence of population structure, are greater by 
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a factor proportional to the number of individuals than the remaining smaller 
eigenvalues. The large eigenvalues arise from the mixed moments of the pair- 
wise site frequency spectra, while the small eigenvalues are attributed to random 
differences between the individuals in the sample. In practice, with finite popula- 
tions, we can detect only the eigenvalues that are well separated from zero where 



the cutoff described in equation (2.1 ) is beyond the boundary of the range of the 
many smaller eigenvalues, which we will refer to as the "bulk" of the eigenvalues. 

We note that the eigenvectors of a sample covariance matrix are also interest- 
ing, but notoriously difficult to analyze mathematically, so this paper is devoted 
solely to understanding the eigenvalues. 

2. Methods 



In setting up the mathematical model, we begin as in Patterson et al. (2006). 
We consider unrelated diploid individuals with independent biallelic markers. 
We assume that the data for our biallelic markers are recorded in a large M x N 
rectangular array C with rows labeled by individuals and columns labeled by 
polymorphic markers. The entries Cj j are the number of variant alleles for marker 
j, individual i, that take values 0, 1 or 2. We assume that we have data for 
M individuals from K populations, and that we have M r individuals from the 
subpopulation labeled r so that M = Mi + M<i + • • • + Mk- Often, neither K nor 
Mi, . . . , Mk are known, so we may wish to estimate the value of K, the number of 
subpopulations in the data. If the population sampling information were known, 
namely, that individual i is from subpopulation r, the genotype probabilities 
for marker j, P(Cjj = 0,1,2) would be given by the expected frequencies in 



population r, as in equations (4.1)-(4.3) 



In what follows we deviate from the method of Patterson et al. , since for 



mathematical analysis it is inconvenient to rely on data-dependent statistics (of 
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mean and variance) to center or to normalize the entries of the array. Instead we 
work directly with the eigenvalues of the uncentered sample covariance matrix 
CC. This is a symmetric square matrix of size M, the number of individuals. 
We consider the eigenvalues of CC which we write in decreasing order A\ > 
A 2 > ■ ■ ■ > Am- 

In this paper we formally derive the mathematical proof of an estimator for the 
number of subpopulations based on the magnitude of these eigenvalues. We derive 
an estimate for the number of subpopulations, K, as the number of eigenvalues 
larger than the threshold of 

t' (VM + Vn) 2 = (l + ./M/n) 2 (2.1) 



Equivalently, for the scaled matrix X we later define in equation (4.10) we can 
use the more intuitive threshold: 

* = (2.2) 
2 v ' 

which does not depend on M, N. The parameter F used here, as defined by 



equation (4.4), takes values between and 1. In practice, F captures departures 
from Hardy- Weinberg equilibrium. 

Hence begins our main result, that depending on the value of F, the threshold 
cutoff t for determining the number of large eigenvalues corresponding to pop- 
ulation structure, is between 0.5 and 1. If there are K subpopulations present 
in the data, then as N and M increase without bound (and are subject to cer- 
tain technical conditions), with overwhelming probability the smallest M — K 



eigenvalues of CC are smaller than t' from (2.1). Furthermore, the consecutive 
largest K eigenvalues are typically much larger than the order of 4XjMN, where 
Xj > is the corresponding eigenvalue of the deterministic (hidden) matrix with 



entries given by (4.8) which we describe below. Since we write all eigenvalues 
in decreasing order, our ability to correctly estimate the value of K depends on 
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the magnitude of the smallest eigenvalue Xk in comparison to the number of 
individuals M. 

Here, in evaluating whether an eigenvalue corresponds to population struc- 
ture, we are effectively comparing a constant between 1/2 and 1 (depending on 
the value of F), to a number larger than XkM. Of course, XkM can be made 
arbitrarily large by increasing the number of individuals M, making it possible 
to easily resolve eigenvalues corresponding to population structure separate from 
the bulk. From our mathematical analysis, we show that the number of popula- 
tions is estimated essentially without error when MXk is larger than 1. In view of 
this strong separation, the eigenvalues of CC, can be safely used in exploratory 
data analysis without need for a formal statistical test to assess significance of 
structure when M is large enough. 

A check for the appropriateness of the cutoff is provided by the histogram of 
the eigenvalues - the K largest eigenvalues should be separated from the remain- 
ing eigenvalues, or the bulk. Under the model of clean population substructure, 
the remaining eigenvalues should cluster together into a fairly solid group, as 
these eigenvalues correspond to random differences among individuals. Ideally, 
one expects the shape of the bulk to be a single-mode semi-elliptical mass located 



with sharp boundaries like the Marchenko-Pastur law (Bai and Silverstein, 2010 



Chapter 3). After normalization (4.10), the distribution of the bulk should be 
located to the left of (1 + F)/2. 

The accuracy of this estimator of K depends on the theoretical smallest pop- 
ulation eigenvalue: 

L = , iMNX « , , (2.3) 



>M + y/N 

That is, it depends on the smallest of the theoretical population eigenvalues of 



(3.1), for the rescaled matrix X^r from equation (4.10). This theoretical value 
indicates whether there is likely to be power to detect the full population sub- 
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structure present in the data, if, for example, L > 0.5. Indeed, our simulations 
confirm that our estimate of population structure works very well whenever L is 
larger than 0.5 (when F = 0), or L is larger than 1 (when F = 1). In practice, 



population parameter (the smallest eigenvalue of matrix Q, see equation (4.8)) 
and thereby L, is a hidden parameter that depends on the theoretical hidden pop- 
ulation moments and on the unknown relative proportions C\ , C2 , . . . , Ck of the 
subpopulations present in the data, and cannot be obtained for non-simulated 
datasets. 

2.1. Robust to violations of assumptions 

In our analysis we explore possible violations of our key assumptions, namely, 
independence among markers and stochastic independence of individuals drawn 
from a population. 

Simulations indicate that linkage disequilibrium (LD), the non-random cor- 
relation of nearby markers that violates our independence assumption, does not 
strongly affect our ability to detect population structure. In our view, the thin- 
ning of markers, such as via the LD-pruning implemented in PLINK, is a simple 
yet robust technique that addresses linkage disequilibrium violations of inde- 
pendence assumptions, that works without the need for corrections described in 



(Patterson et al. 2006; Shriner, 2012). Our formulas show that no matter what 
the value of Xk is, one can safely reduce the value of N by thinning, or removing 
nearby markers that are highly correlated, without a significant loss of accuracy. 
For example, to compensate for thinning which would reduce the number of mark- 
ers N = 500 Af to N' = 100M, it is enough to increase M by about 20%. To 
compensate for the potential loss of accuracy due to thinning down to N' = 10M, 
one needs to increase M by about 73%. Furthermore, if Xk is far enough from 
zero so that L is much larger than 0.5, then thinning will have no effect even if 
the number of individuals M is kept unchanged. Our analysis shows that the 
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number of subpopulations K determined by this procedure is quite insensitive to 
the number of markers N, as long as this number is well above the number of 
individuals M. 

A possible violation of the assumption of stochastic independence of individ- 
uals is non-random mating, which results in departures from Hardy- Weinberg 
equilibrium (HWE). We find that that departures from HWE do not signifi- 
cantly reduce the power of PCA for detecting population substructure. In fact, 
to compensate for F = 1 instead of F = it is enough to increase the number 
of individuals M in the study by a factor of 2; however, no such corrections are 
needed if the smallest eigenvalue Xk is separated from zero well enough so that 
L > 0.5. 

Lastly, our application to real genotype data from the International HapMap 
Project (HapMap) suggests that cryptic relationships between the individuals 
in the data may affect the applicability of our method to a larger degree than 
LD. Under a simple substructure scenario, the bulk of eigenvalues should have 
a unimodal elliptical shape similar to the Marchenko-Pastur distribution, easily 
distinguished from large eigenvalues corresponding to substructure. However, as 
we demonstrate in Figure [2j individuals may exhibit cryptic relatedness, or other 
unknown non-random relationships, which result in changes to the distribution of 
the bulk, making it difficult to infer the correct cutoff for substructure. We find 
that pruning for LD does not seem to improve the fit of the bulk to Marchenko- 
Pastur distribution. Instead, exclusion of related individuals improves fit of the 
bulk; hence, we suggest that it is necessary to remove related individuals from 
the sample to improve the resolution of true substructure. 

2. 2. Summary 

Overall, our mathematical analysis confirms empirical evidence that PCA 
is a robust technique for learning about population substructure of a dataset. 
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Contrary to current practice, based on the mathematical theory presented in 
the following sections we recommend using PCA directly on the data matrix C 
without centering or renormalization. With sufficient sample size, there should 
be strong separation between the large eigenvalues corresponding to population 
structure and the remaining bulk of the distribution. We illustrate a proof of 
principle of our approach through simulations and application to human genotype 
data from world-wide populations. 



3. Results and Discussion 

In this section we illustrate the power of our mathematical findings for infer- 
ence of population structure in genetic data. We begin with the simulations for 
a "simple model" where all the hidden parameters can be computed. This allows 



us to analyze sensitivity of the technique to the precise value of L in (2.3). In 
particular, since we are able to compute the hidden parameter L, we can then 
see how well our predictions match theory, and how well powered we are to de- 
tect the known substructure. Then we consider an intermediate stage - we use 



simulations from (Gao et al. 2011) where the true demography is known and 
each individual is a member of one of the populations, but for which the hid- 
den population parameters are not known. The datasets cover several different 
demographic models, with different population split times, trees, and migration 



rates. For more details on each of the models, see reference (Gao et al. 2011). 
Finally, we apply the theory to human genotype data from world-wide popula- 
tions, where we discuss additional challenges due to linkage disequilibrium and 
cryptic relatedness, and where the value of L is not available. 

3.1. Simulations for a simple model 

We show that the theoretical approximations to the largest eigenvalues work 
very well when all the assumptions of mathematical analysis are satisfied. We gen- 
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erate simulations based on a simple model in which we make several assumptions 
that are unrealistic, but allow us to compute important mathematical parameters 
to explore the performance of our method. We assume that the site frequency 
spectra are known for each population. We also know how many individuals 
came from each subpopulation, and that the populations are independent. The 
latter corresponds to a scenario where all populations diverged and stopped in- 
teracting in the distant past. Though unrealistic, such a simplistic model has the 
advantage that all relevant quantities that enter mathematical analysis can be 
computed. In particular, this model allows us to study the effects of choosing a 
small enough number of individuals M and analyze how the failure rate for the 
estimator depends on the value of L, see Table [T] 

In our simulations we use unequal subpopulation samples sizes, drawn with 
proportions c\ = 1/6, C2 = 1/3, C3 = 1/2. The theoretical population propor- 
tion p r {j) at each SNP location for each population was selected from the same 
site frequency spectrum ip{x) = Q.h/y/x. We selected Pi{j) , P2U) , P3U) inde- 
pendently at each location j which corresponds to product joint site frequency 
spectrum (p(x, y, z) = <p(x)<p(y)<p(z) for our K = 3 simulated populations. We 
then simulated independent individual genotypes for the j-th marker of a member 
of the r-th population by choosing independent binomial values (with 2 trials) 
with probability of success p r (j)- We can evaluate how well the mathematical 
description matches the simulated data because we can explicitly compute the 



theoretical matrix of moments (4.5 4.6) and hidden matrix Q defined by (4.8): 



m 



r.s\ 



1/5 1/9 1/9 0.0333 0.0262 0.0321 

1/9 1/5 1/9 , Q = [^/c^Csm^s] = 0.0262 0.0667 0.0454 
1/9 1/9 1/5 0.0321 0.0454 0.1000 

The eigenvalues of the above matrix Q are [Ai, A2, A3] = [0.1467, 0.0355, 0.0178]. 
The theoretical prediction for the observed largest eigenvalues of the normalized 
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sample covariance matrix (4.10) are then 



A, « 3 — 2 . (3.1) 

r M + Vn) 



The actual observed eigenvalues will not match exactly these predictions; the 
purpose of the simulations is to illustrate how far the empirical values for finite 
M, N differ from the values predicted by theory in the limit as M, N tend to in- 



finity. For example, formula (3.1 ) with M = 1200, N = 25000 gives the following 



values: (47.4,11.5,5.7). In a simulation, we obtained the following eigenvalues 



for the normalized matrix (4.10): 



(Ai, A 2 , A 3 , A 4 , A 5 , • • • ) = (48.2, 11.5, 5.8, 0.27, 0.26, . . . ) 

We see that the threshold of 0.5 separates clearly the K = 3 largest eigenvalues, 
set in boldface, from the bulk. 

We remark that there are two sources of error: the approximation Bjy ~ Q 
that appears in Lemma [2] and then the approximation due to randomness within 



each population that is still present in (4.12) for finite numbers of SNPs, N . It is 



therefore encouraging to see that the predicted values for the eigenvalues match 
well with the empirical eigenvalues in the simulations for realistic values of M = 
1,200, N = 25,000 as well was for much smaller values of M, see Table [TJ or 
even for M = 12. When M = 24 and N = 100, we get L = 0.77 and we are 
successful in determining correct value of K = 3 the vast majority of the time: 
in 100,000 simulations, K is underestimated in only a minuscule 0.005 % of the 
runs and never overestimated. Reducing further the number of individuals to 
M = 12, leads to L = 0.471 and in this case, as expected, the rate at which our 
estimate of K fails increases. But the decrease in accuracy is not dramatic, and 
an underestimate of K = 2 occurs only in about 6.5% of runs. For reasonably 
large M and N, our power is quite high, and the false positive error rate was 
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strictly zero for all scenarios, since an overestimate of K did not occur in any of 
the replicates under any scenario. 



Table 1 False negative probability as a function of L (based on 100,000 simula- 
tions) 

M N L P(K < 3) 

individuals SNPs False negative rate 

1200 25000 5.747 

48 100 1.926 

24 100 0.770 0.00005 

12 100 0.471 0.065 

12 50 0.385 0.53 

6 100 0.276 0.87 



3.2. Simulated genetic data under various demographic scenarios 

Next we applied our method to simulated substructured datasets generated 



by coalescent simulations under various demographic scenarios from (Gao et al. 



2011). This dataset has N = 100 markers sampled from subpopulations with 



constant population sizes of 50 individuals, and with varying M = 50 x K, 
K = l,...,5. 

In these simulations, Q is not known. But we can estimate L by using the 



smallest eigenvalue of the empirical approximation to Q based on formula (4.16). 



The observed accuracy under each scenario, shown in Tables [2]-[4j is in line with 
our results from simulations listed in Table [TJ 



Model-based approaches such as those evaluated in (Gao et al. 2011) are 
likely to outperform PCA detection for such small sample sizes, since the true sub- 
structure corresponds to that found in the underlying STRUCTURE-Yike model 



(Falush et al. 2003). However, using our method, we have no false positives in 
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any of these sets of simulations. From these simulations we find that the error 
rates are not affected by using approximate Q in the calculations instead of ex- 
act Q when we approximate L from the population data. (The values L of the 
approximated L varied considerably in the simulated 50 runs for a model, but 
L > 0.5 was associated with the correct value of K in each case.) 

Table 2 False negative (error) rates of estimates of K for 50 simulated data sets 
under model Split 



True K 
F{K < K) 



2 3 4 5 
0.0 0.14 0.80 0.98 



Table 3 False negative (error) rates of estimates of K for 50 simulated data sets 
under model Inbred 



True K 
F{K < K) 



2 3 4 5 
0.0 0.16 0.72 1.0 



Table 4 False negative (error) rates of estimates of K for 50 simulated data sets 
under model Mig 



True K 
F(K < K) 



2 3 4 5 
0.0 0.02 0.10 0.46 



3.3. Application to human population genotype data 

We next examined the distribution of eigenvalues for a dataset of human geno- 
type data. The International HapMap Project was designed to create a catalog 
of human genetic variation to find genes that affect health, disease, and indi- 
vidual responses to medications and environmental factors. We use a genome- 
wide SNP dataset made publicly available through this project as HapMap 3 
(HapMap 3, release 3, human genome build 36) which contains genotypes of 
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individuals from 11 human populations, comprised of genotype data collected 
using two platforms: the Illumina HumanlM and the Affymetrix SNP 6.0 ar- 
rays. These populations and datasets have been extensively studied previously 
(see http://hapmap.ncbi.nlm.nih.gov/publications.html.en for a list of publica- 
tions) . 

Unlike simulated data, the true substructure of the complete set of popula- 
tions is unknown. We therefore report the performance of our theoretical analysis 
on the subset of well defined subpopulations, which are believed to have clear sub- 
structure: the Yoruba, of Ibadan, Nigeria (YRI), European Americans from Utah 
(CEU), and Chinese from Denver (CEU). 

After extracting the CEU, CHB, and YRI individuals, we processed the data 



through PLINK ( |Purcell et al.[|2007[ ) with filters —filter-founders — geno 
to remove SNPs with any missing data and exclude offspring of trios, and further 
exclude non-autosomal markers. The final dataset of M = 297 individuals and 
N = 736750 markers was used for the analysis of the eigenvalues. 

As expected from mathematical theory and from the choice of very distinct 
subpopulations, the eigenvalues of matric X split into the bulk in Figure [I] that 
lies below the cutoff of 0.5, and three large eigenvalues Ai = 102.0, A2 = 14.55, 
and A3 = 7.37 that exceed the cutoff of 0.5 and give an estimate of K = 3, 
which matches our prediction for these three populations. The histogram seems 
to show some possible eigenvalues separated from the bulk, but they may as well 
correspond to various minor deviations from the model that are present in real 
data - we discuss this issue below. 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 



Figure 1: The bulk of the eigenvalues from PCA of Hapmap CEU, CHB, and YRI unrelated 
parents. The largest three eigenvalues that correspond to subpopulation structure are Ai = 
102.0, A 2 = 14.55, A 3 = 7.37 and are not shown. 



3-4- Practical comments on using theory 

Theorem [I] and the cutoff of 0.5 should be used in real data only after visual 
control for the shape of the bulk and for the separation from the bulk of the 
largest eigenvalues. Simulations indicate that when the theory is applicable the 
histogram of the bulk is located to the left of 0.5 (or 1 when F = 1) and its shape 



resembles the Marchenko-Pastur law ((Bai and Silverstein, |2010 Chapter 3)) of 
the same ratio N/M. For a typical large value of N/M > 50, this shape looks 
similar to a semi-ellipse. 

The shape of the bulk is affected by relationships between the individuals. 
This is best illustrated when the offspring of trios are included in the analysis 
of the three populations HapMap dataset. Then the shape of the bulk does 
not follow Marchenko-Pastur law, and instead resembles a shape reproduced by 
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repeating a large number of individuals, see Figure [2} 




M = 700; N = 10000; N/M = 14.2 M = 405; N = 660847; N/M = 1631.7 

Figure 2: Including related individuals perturbs the expected shape of the bulk of the eigenval- 
ues. Left: The bulk of the eigenvalues from PCA using data generated via binomial simulation, 
where 14% of the individuals have been repeated. Right: The bulk of the eigenvalues for PCA 
of three populations of HapMap (CEU, YRI, and CHB) including trios - 297 parents and their 
related 108 offspring. Both simulated data and empirical genotype data show that inclusion of 
related individuals results in a multi-modal distribution of the bulk of the eigenvalues, arising 
from the non-random correlations of individuals. 

The shape of the bulk for the full HapMap data set seems to exhibit additional 
deviation from the expected shape, extends well beyond 0.5, and the distribution 
of the bulk might not be unimodal. These deviations cannot be attributed to 
linkage disequilibrium as they do not disappear after LD pruning. Instead, we 
expect these deviations from the expected shape correspond to complex substruc- 
ture and relationships among individuals with insufficient power to be detected 
with so few individuals. 
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Figure 3: The bulk of the eigenvalues from PCA of all 11 populations in Hapmap unrelated 
parents. Nonautosomal markers with M = 924, N = 422253. The six largest eigenvalues 
Ai = 335.9, A 2 = 37.4, A 3 = 16.7, A 4 = 2.5, A 5 = 2.1, A 6 = 1.7 are not shown 



3.5. Conclusion 

Eigenvalues of the uncentered covariance matrix CC' larger than the theo- 



retical threshold (2.1), when combined with overall histogram of eigenvalues, are 
a consistent indicator of the presence of subpopulations in the data. We demon- 
strate in two proof-of-principle simulations that we are able to obtain evidence 
of population structure when the number of individuals is large enough. Our 
estimate is conservative, and we never encounter false positive evidence of popu- 
lation structure. That is, with independent markers, we never obtain evidence of 
more populations than present in the simulations. We encounter a loss of power 
(false negatives) in small simulated data sets. The accuracy of estimating the 
number of subpopulations K depends on the number of individuals M in the 
sample and is not improved significantly by increasing solely the number N of 
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available markers. 
4. Theory 

Our goal in this section is to point out aspects of population structure that 
could be responsible for the observed phenomenon that the set of eigenvalues of 
CC splits into two groups: a small set of K large eigenvalues, and a large set 
of M — K of small eigenvalues. Since CC is a random matrix, we want this 
split to occur with overwhelming probability. This task requires a more detailed 
specification of the model. While the statements become more cumbersome, the 
gain is a clear indication of how different aspects of the model influence our ability 
to discover the subpopulation structure, and under what circumstances it may 
remain hidden. 

We assume that our genetic markers are biallelic and that we have N poly- 
morphic markers. We assume that we have diploid individuals from K subpopu- 
lations, and that each subpopulation r is described by its own allelic probability 
measures n r> i, . . . , 7r rj 7v- That is, we assume that the data for the j-th marker 
of an individual from the r-th subpopulation take values 0, 1,2 with respective 
probabilities Tr r j{0), 7i>j(l), 7i>j(2), which correspond to the frequencies of the 
genotypes in this population. Probabilities ir r j are a two-parameter family which 
we will parameterize by the half-mean: p r (j) = vr r j(2) + ^7r rj (l) (allelic proba- 
bilities) and by the coefficients 

1 _ F = 

rj 2 Pr (j)(i- Pr (j)y 

so that the probabilities for the j-th marker of an individual from the r-th sub- 
population take values 0, 1, 2 with respective probabilities. 
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7iy,,(0) 



7T r j(l) 



7T r j(2) 



(1 -Pr(j)) +F rd p r (j)(l -Pr(j)), 
2 Pr (j)(l- Pr (j))(l-F r , j ), 
PrU? +F rjPr (j)(l-p r (j))). 



We write 



sup F r j 

r,j 



(4.1) 
(4.2) 
(4.3) 

(4.4) 



for the least upper bound on these parameters. We always have the bound 
< F < 1, and F = for a randomly mating population in Hardy- Weinberg 
equilibrium. The conservative value F = 1 is appropriate to use when the data 
depart significantly from Hardy- Weinberg equilibrium. 

One possible interpretation of F is that it describes the least upper bound 
for the probability of mis-counting a heterozygote when processing the biological 
data; a probabilistic interpretation is that F is the so called ^-mixing measure of 



dependence (Bradley, 2005) between the alleles, without attributing the causes 



of dependence. When F r does not depend on r, j another interpretation of their 



common value F is that this is an inbreeding coefficient (Wright, 1943) 



The expected value of the law (|4.1[)-(4.3 ) is 2p r (j) and the variance is 2(1 + 



Fr,j)Pr(m ~ Pr(j)) < (l + F)/2. 

In diffusion models, allelic probabilities p r (l), . . . ,p r (N) are considered ran- 
dom and are then adequately described by their density function ip r (x), x G [0, 1], 



see e.g. (Kimura, 1964). We shall call ip r (x) the site frequency spectrum for the 
r-th population. 

Analysis of multi-population demographic models may be based on the joint 



site frequency spectra (Gutenkunst et al. 2009 Xie, 2010 Bustamante et al. 



2001). For our analysis we do not really need the multi- variable joint frequency 



spectra; we only need to model each pair of populations r, s in terms of its pairwise 
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site frequency spectrum which is a probability measure ^ TyS {dx,dy) on the unit 
square [0,1] x [0,1]. In some situations, $? r ^ s (dx,dy) can be described by the 
corresponding density tp r>s (x, y), and we adopt this for notational simplicity. Ref. 



(Gutenkunst et al. 2009) implements numerical approximation of joint frequency 
spectra for up to three simultaneous populations. 

In practice, due to ascertainment bias the joint site frequency spectrum 
^i,...,k(xi, ■ ■ ■ ,Xk), or even the pairwise site frequency spectra ip riS (x,y) are of- 
ten difficult to estimate from the genetic data. We describe population structure 
in SNP genotyping arrays by the ascertainment biased spectrum (p rjS (x,y) mod- 
eling each pair r, s of the populations. Consequently, we assume that allelic 
probabilities p r (l), . . . ,p r (N) are random and are adequately described by their 
density function ip r (x), and that for the j'-th locus each pair of allelic probabilities 
(Pr(j)iPs(j)) follow the same bivariate distribution with density ip r)S (x,y). We 
introduce the pairwise population moments: 

m T ^ r = I x 2 f r (x)dx, (4-5) 



J 

and for r ^ s, 

m r>s = / / xyip rtS (x,y)dxdy. (4.6) 
Jo Jo 

The K x K array of deterministic numbers m r)S together with the relative sub- 
population sizes are the parameters that enter our mathematical analysis. 

Assumption 4.1. For any pair of subpopulations labeled by r,s £ {1, . . . , K}, 
we assume that 

1 N 



3=1 



In particular, we assume that limiv->oo Y^,j=iPrU) = m r,r exists. 



A natural situation where the limit (4.7) exists is when the allelic probabil- 
ities have been sampled (independently, or with weak correlations) from a joint 
ascertainment biased site frequency spectrum ^...^(xi, • • • > £fc)- 



19 



The allelic probabilities p r (l), . . . ,p r (N) for the r-th subpopulation are of 
course unknown but represent the true underlying frequency of the alleles in 
the r-population and they are fixed when sampling the individuals from the 
population. 

For 1 < r < K consider infinite sequences {p r (j) : j G N} of numbers in [0, 1] 
such that the limits (4.7) exist. We also fix a sequence of constants c\, . . . , ck > 
that add up to 1. 

Consider a K x K (deterministic) symmetric positive matrix Q with entries 



[Q]r 



(4.8) 



where m r ^ s are given by (4.5) and (4.6). Next we assume that we have sequences 

Mi(iV), . . .,M K (N) -> oo of integers such that with M(N) = Mi(iV) H h 

M K (N), we have M(N)/N -> c for some c > 0, and M r (N)/M(N) c r > 
as N — > oo. We assume that all the entries of matrix C = Cat are independent, 
conditionally on {p r< j}- 

Since the singular values of C do not depend on the order of the rows, for 
mathematical analysis we assume that the individuals were arranged by popula- 



tion, so that C has block structure (4.9). 



Ci 

c 2 

C K 



(4.9) 



where C r is the M r (N) x N sub-matrix representing the data for the individuals 
from the r-th subpopulation. For r = 1, . . . , K , i = 1, . . . , M r , j = 1, . . . , N, we 



assume that the distribution of entry [C r ]jj is given by (4.1)-(4.3). 

For the almost sure results we also need to assume that C comes from an 
infinite matrix. More specifically, we assume that conditionally on {p r ,j}, each of 
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the K blocks C r arises as an upper-left M r x N corner of an infinite matrix with 



independent entries that have distribution (4.1)— (4.3). We also need to make a 



technical assumption that the number of individuals from the r-th subpopulation 
increases with N, that is M r (N + 1) > M r (N). 

Since the eigenvalues of CC are large, it is more convenient to consider the 
normalized M x M sample covariance matrices 

1 



X 



N 



CC. 



(4.10) 



Of course, we assume N > M > K . 



Theorem 1. Let \\ > X 2 > Xk be the (deterministic) eigenvalues of matrix Q 



defined by p~8| ) and let A x (iV) > A 2 (iV) > • • • > A M{N) (N) > be the (random) 



eigenvalues of the M{N) x M(N) sample covariance matrix X^r from (4.10). 
Then, as N — > 00, with probability one 



Ak+i(N) < (l + F)/2 



(4.11) 



and 



+ _L] [A 1 (N),A 2 (N),...,A K (N)]^A[X 1 ,X 2 ,...,X K ] (4.12) 



y/M(N) VN 



Remark 4.1. Under Hardy- Weinberg equilibrium F = 0, so (4.11) takes form 



A^+i(A0 < 1/2. 



(4.13) 



However, usually the value of F is not known. In such cases, since < F < 1, 



while Ak(N) — > 00 as N — > 00 is much larger than 1, formula (4.11) may be 
replaced by 

A^ +1 (A0 < 1. (4.14) 



When Q has full rank K formula (4.12) indicates that for large M,N the 



first K largest empirical eigenvalues of X^r are large and can be estimated from 
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the eigenvalues of Q. Formula (4.11) shows that the remaining eigenvalues are 
relatively small, and are of the order 1/M smaller than the largest K eigenvalues 
(here we assume the standard setting in SNP data where M < N). 
Remark 4.2. We have much less information for the case when Q has rank K' < K 
with positive eigenvalues Ai > A2 > Xk> > but Xr'+i = • • • = A^ = 0. In this 



case the bulk is still concentrated below 1/2, as (4.11) shows that A^+i < 1/2 



From (4.12) we deduce that the eigenvalues Ai, . . . , Ak> diverge to infinity. But 



we do not have any information about Ak'+i, . . . , Ak which in this case are only 
known to be of order smaller than , r -M N r — s ^ ; we do not have any mathematical 



results about their relation to the "cutoff" (1 + F)/2. 

4-1. Proof of Theorem^ 

Let P k £ R N denote the column vector [p k (l), . . .,p k (N)}'. Let E k £ R M be 
the column vector of ones at the rows corresponding to the k-th block of C, i.e. 

with [E k ] r = 1 if Mi H h M k _i < r < M\ H V M k and otherwise. Then 

E(C) = lYLh=\ E k p 'k' and we write 

K 

C N = V + 2J2E k PL (4-15) 

k=l 

where V is an M x N matrix of centered independent uniformly bounded random 



variables. Note that E(C) factors as in (Engelhardt and Stephens, 2010, Eqtn. 
(1)), but we keep an additional term in analyzing ( 4.15| ). 

Let Ai > • • • > Xk > be all eigenvalues of Q. (Recall that N > M = 
M(N) > K.) 

Lemma 2. Let o~i(N) > o-2(N) > ■■■ > o~k{N) > be the singular values of 

Y,k=i E k p k - Then f or 1 <3< K , limTV^oo NM(N) = X r 

Proof. Consider the sequence of K x K matrices B^r with entries 

[BN)r,s = ^pJ2Pr(3)PsU)- (4-16) 
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(Recall that M r = M r {N) is a function of N.) Since B^v —> Q entrywise, its 
eigenvalues Ai(Bjv), . . . , A^Btv) converge to X±, . . . , Xk- However, Aj(Bjv) = 
-jfjj- for all j £ {1, . . . , K}. To see this, denote Uk = Jj^^ k- Then 

(j2 E k P k) (Y, E k P l) =NMY / l^N]r, s U r U' s 

\k=l / \fc=l / r,s=l 

Let now x = [x%, . . . , xk]' be an eigenvector corresponding to eigenvalue A of 
Bat. Then y = Ylk=l x ^k £ K M is an eigenvector of Y^s=i\^N]r,sU r U' s with 
the same A. So NMX is the square of a singular value of Ylk=i E kPl- Note 
that orthogonal vectors x correspond to orthogonal y, so this procedure exhausts 
the first K eigenvalues, even if they are repeated or 0; the remaining M — K 
eigenvalues are zero . □ 

Lemma 3. With probability one, as M/N — > c > 0, 

v 1 iiy I, / A + F 
lim sup = Vjv < 

7V->oc VM + VN 



Proof. Recall that 

||V|| = su P {||Vx|| r m : \\x\\ RN = 1} = VAi(W') 

is a convex function of the entries of V. 

We apply a non-i.i.d. version of ( |Yin et al. , 1988, Theorem 3.1), as extended 



in ( |Couillet et al.[ |2010| Theorem 3) to the matrix Bjy = (V + Z)/(VM + 
V~N), where Z is the matrix of independent two- valued random variables that 
compensate for the differences in the variances of entries of V. That is, since 

E ([V] 2 rJ ) = 2(1 + F r j) PrJ (l- Pr>j ) < (1 + F)/2, (4.17) 

we request that E([V] r j + [Z] rJ ) 2 = (1 + F)/2. 

Note that here we use a discrete distribution with just two values [Z] r j = ±a r j 



such that afj<(l + F)/2<l. So assumption (2) of (Couillet et al. 



2010 
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Theorem 3) holds with a constant X, and assumption (3) of (Couillet et al. 
Theorem 3) holds with ip(x) = x 2 . 



2010 



We also need to make sure that assumption (1) of (Couillet et al. 2010 



Theorem 3) is satisfied, that is, we need to apply this theorem to a matrix that 
comes as an upper-left corner of an infinite matrix with independent entries. 

To do so, we permute the rows of matrix C so that is arises as an M{N) x N 
sub-matrix from the infinite matrix which is described as follows. Recall that we 
assume that each matrix C r comes as the upper-left corner from an infinite matrix 
of independent entries that correspond to population r. We now use these infinite 
matrices to construct a single infinite matrix, from which C arises by permuting 
the rows of the M(n) x N upper- left sub- matrix. We start with N = 1 and 
Mi(l) (infinite) rows from population 1, followed by M 2 {X) rows from population 
2, through Mr-(I) rows from population K. Then in the second "pass" we add 
Mi(2) — Mi(l) rows from population 1, M2(2) — M2(l) rows from population 
2, etc. In the iV-th pass, we add M\(N) — M\(N — 1) rows from population 
1, M 2 (N) - M 2 (N - 1) rows from population 2, though M K (N) - M K (N - 1) 
rows from population K. Notice that since M r {N) are increasing in N, and 
M{N) —7- oo, this construction does not stop and gives an infinite matrix. It is 
clear, that after the iV-th pass, the resulting M(N) x N sub-matrix will have 
exactly M r {N) rows from population r, so that its eigenvalues are the same as 
those of matrix C. 



Then, from (Couillet et al. 2010 Theorem 3) we infer that with probability 
one, 

,'l + F 

lim — t= :=\\\~n + Zn\ 



jv^oo y/N + VM V 2 

Therefore, denoting by /z(cZZ) the integral with respect to the law of the entire 
infinite sequence Zij, by Jensen's inequality and Fatou's lemma we have 
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limsup ||Vjv| 

iV— >oo 



s/M + VN) = limsup || f(y N + Z)fi(dZ)\\/(VM + VN) 



< 



< 



limsup f \\V N + Z\\ti(dZ)/(VM + VN) 
limsup || V N + Z||/ (Vm + Vn)) n(dZ) 



1 + F 



with probability one. 



□ 



Proof of Theorem [7} This part of the proof is similar to ( Silverstein , 1994| ) . In 



(4.15 ), we consider Cat as a small perturbation of the finite rank matrix 2 X^fc=i ^kP'k- 
Denote by Ti(iV) > • • • > tk(N) the largest (deterministic) singular values of 

K 



N + VM 



k=l 



and set tj(N) = for j > K. Then it is known, see e.g. (Horn and Johnson 
Theorem 3.3.16(c)), that the singular values of 



1994 



i 



decreasing order, differ by at most 



i 



/N+VM 



C, written in 



N+VM 

Vjv|| from the corresponding singular 



values Tj(N), written in decreasing order. 

Since Tj(N) = for j > K, from Lemma [3] we get (4.11). 
Lemma [2] shows that 



1 1 

+ 



Tj (N)(VM + s/N) /VMN = Tj (N) 
for 1 < j < K, so 



2 J\ 



l l 

+ 



has the same limit, and by taking squares of both sides we get (4.12) 



□ 
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